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Abstract 

We present a distributed data structure, which we call the rainbow skip graph. To our 
knowledge, this is the first peer-to-peer data structure that simultaneously achieves high fault- 
tolerance, constant-sized nodes, and fast update and query times for ordered data. It is a 
non-trivial adaptation of the SkipNet/skip-graph structures of Harvey et al. and Aspnes and 
Shah, so as to provide fault-tolerance as these structures do, but to do so using constant-sized 
nodes, as in the family tree structure of Zatloukal and Harvey. It supports successor queries 
on a set of n items using O(logn) messages with high probability, an improvement over the 
expected 0(log n) messages of the family tree. Our structure achieves these results by using the 
following new constructs: 

• Rainbow connections: parallel sets of pointers between related components of nodes, so as 
to achieve good connectivity between "adjacent" components, using constant-sized nodes. 

• Hydra components: highly-connected, highly fault-tolerant components of constant-sized 
nodes, which will contain relatively large connected subcomponents even under the failure 
of a constant fraction of the nodes in the component. 

We further augment the hydra components in the rainbow skip graph by using erasure-resilient 
codes to ensure that any large subcomponent of nodes in a hydra component is sufficient to 
reconstruct all the data stored in that component. By carefully maintaining the size of related 
components and hydra components to be O(logn), we are able to achieve fast times for updates 
and queries in the rainbow skip graph. In addition, we show how to make the communication 
complexity for updates and queries be worst case, at the expense of more conceptual complexity 
and a slight degradation in the node congestion of the data structure. 

Categories and Subject Descriptors: E.2 Data Storage Representations 

General Terms: Algorithms, Design. 

Keywords: Distributed data structures, peer-to-peer networks, skip lists, skip graphs, family 
trees, erasure codes. 



1 Introduction 

Distributed peer-to-peer networks present a decentralized, distributed method of storing large data 
sets. Information is stored at the hosts in such a network and queries are performed by sending 
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messages between hosts (sometimes iteratively with the query issuer), so as to ultimately identify 
the host(s) that store(s) the requested information. For the sake of efficiency, we desire that the 
assignment and indexing of data at the nodes of such a network be done to facilitate the following 
outcomes: 

• Small nodes: Each node in the structure should be small. Ideally, each node should have 
constant size, including all of its pointers (which are pairs (x, a), where x is a host node and 
a is an address on that node). This property allows for efficient space usage, even when many 
virtual nodes are aggregated into single physical hosts. (We make the simplifying assumption 
in this paper that there is a one-to-one correspondence between hosts and nodes, since a 
blocking strategy such as that done in the skip- webs framework of Arge et al. [2], can be used 
to assign virtual nodes to physical hosts.) 

• Fault tolerance: The structure should adjust to the failure of some nodes, repairing the 
structure at small cost in such cases. Ideally, we should be able to recover the indexing data 
from failed nodes, so as to be able to answer queries with confidence. 

• Fast queries and updates: The structure should support fast queries and insertions/deletions, 
in terms of the number of rounds of communication and number of messages that must 
be exchanged in order to complete requested operations. (We are not counting the internal 
computation time at hosts or the query/update issuer, as we expect that, in typical scenarios, 
message delays will be the efficiency bottleneck.) 

• Support for ordered data: The structure should support queries that are based on an ordering 
of the data, such as nearest-neighbor searches and range queries. This feature allows for 
a richer set of queries than a simple dictionary that can only answer membership queries, 
including those arising in DNA databases, location-based services, and prefix searches for file 
names or data titles. 

To help quantify the above desired features, we use the following parameters, with respect to a 
distributed data structure storing a set 5 of n items: 

• M: the memory size of a host, which is the number of data items (keys), data structure 
nodes, pointers, and host IDs that any host can store. 

• Q{n): the query cost — the number of messages needed to process a query on S. 

• U{n): the update cost — the number of messages needed to insert a new item in the set S or 
remove an item from the set S. 

• C{n): the congestion per host — the maximum (taken over all nodes) of the expected fraction 
of n random queries that visit any given node in the structure (so the congestion of a single 
distributed binary search tree is G(l) and the congestion of n complete copies of the data set 
is e(l/n)). 

We assume that each host has a reference to the place where any search from that host should 
begin, i.e., a starting node for that host. 
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Figure 1: A skip list. Each element exists in the bottom-level list, and each node on one level is 
copied to the next higher level with probability 1/2. A search can start at any node and proceed 
up to the top level (moving left or right if a node is not copied higher), and then down to the 
bottom level. In the downward phase, we search for the query key on a given level and then move 
down to the next level, and continue searching until we reach the desired node on the bottom. The 
expected query time is O(logn) and the expected space is 0{n). 

1.1 Previous Related Work 

There is a significant and growing literature on distributed peer-to-peer data structures. For ex- 
ample, there is a considerable amount of work on variants of Distributed Hash Tables (DHTs), 
including Chord PISO], Koorde [13], Pastry [18], Scribe [19], Symphony [H], and Tapestry [22], to 
name just a few. Although they have excellent congestion properties, these structures do not allow 
for non-trivial queries on ordered data, such as nearest-neighbor searching, string prefix searching, 
or range queries. Aspnes and Shah [4j present a distributed data structure, called skip graphs, for 
searching ordered data in a peer-to-peer network, based on the randomized skip-list data struc- 
ture [I7|. (See Figure (H) Harvey et al. [I2] independently present a similar structure, which they 
call SkipNet. These structures achieve 0(logn/n) congestion, expected O(logn) query time, and 
expected O(logn) update times, using n hosts, each of size O(logn). Harvey and Munro [llj present 
a deterministic version of SkipNet, showing how to achieve worst-case O(logn) query times, albeit 
with increased update costs, which are O(log'^n), and higher congestion, which is 0{\ogn/'nP'^^). 
Zatloukal and Harvey [21] show how to modify SkipNet to construct a structure they call family 
trees, achieving O(logn) expected time for search and update, while restricting M to be 0(1), 
which is optimal. Manku, Naor, and Wieder [15j show how to improve the expected query cost for 
searching skip graphs and SkipNet to 0(logn/loglogn) by having hosts store the pointers from 
their neighbors to their neighbor's neighbors (i.e., neighbors-of-neighbors (NoN) tables); see also 
Naor and Wieder [16]. Unfortunately, this improvement requires that the memory size and ex- 
pected update time grow to be 0(log^ n), with a similar degradation in congestion, to 0(log^ n/n). 
Focusing instead on fault tolerance, Awerbuch and Scheideler |5] show how to combine a skip 
graph/SkipNet data structure with a DHT to achieve improved general fault tolerance for such 
structures, but at an expense of a logarithmic factor slow-down for queries and updates. Aspnes et 
al. [3] show how to trade-off the space complexity of the skip graph structure with its congestion, 
by bucketing intervals of keys on the "bottom level" of their structure. Their method reduces the 
overall space usage to 0{n), but increases the congestion to 0(log^n/n) and still requires that M 
be O(logn). Arge et al. [2] present a framework, called skip-webs, which generalizes the skip graph 
data structure to higher dimensions and achieves 0(logre/ loglogn) query times, albeit with M 
being O(logn) rather than constant-sized. 
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Method 


M 


Q(n) 


U{n) 


C{n) 


skip graphs/SkipNet [41IT2] 


O(logn) 


O(logn) w.h.p. 


O(logn) w.h.p. 


0(log n/n) 


NoN skip-graphs [H [16] 


O(log'n) 


0(log n/ log log n) 


0(log2 n) 


0(log2 n/n) 


family trees [21] 


0(1) 


0(l0g7l) 


O(logn) 


0(log n/n) 


deterministic SkipNet [TT| 


O(logn) 


O(logn) 


0(log2 n) 


0{n"-^yn) 


bucket skip graphs [3] 


O(logn) 


Oilogn) 


O(logn) 


0(log2 n/n) 


skip-webs 2 


O(logn) 


0(log n/ log log n) 


0(log n/ log log n) 


0(log n/n) 


rainbow skip graphs 


0(1) 


O(logn) w.h.p. 


O(logn) amort, w.h.p. 


0(log n/n) 


strong rainbow skip graphs 


0(1) 


O(logn) 


0(log n) amort. 


0(n'/n) 



Table 1: Comparison of rainbow skip graphs with related structures. We use 0(*) to denote an 
expected bound. 

Thus, the family tree [21] is the only peer-to-peer structure we are familiar with that achieves 
efficient update and query times for ordered data while bounding M to be 0(1) and maintaining 
a good congestion, which is 0(logn/n). Unfortunately, Zatloukal and Harvey do not present any 
fault-tolerance properties of the family tree, and it seems difficult to do so. 

1.2 Our Results 

In this paper, we present rainbow skip graphs, which are an adaptation of the skip-graph of Aspnes 
and Shah [1] designed to reduce the size of each node to be 0(1) while nevertheless keeping the 
congestion at 0(logn/n) and providing for improved fault tolerance. Successor queries use O(logn) 
messages with high probability, an improvement over the expected O(logn) messages of the family 
tree. The update and congestion complexities of our structure are also optimal (amortized in the 
update case), to within constant factors, under the restriction that nodes are of constant size. In 
addition, we present a strong version of rainbow skip graphs, which achieve good worst-case bounds 
for queries and updates (amortized in the update case), albeit at a slight decrease in congestion, 
which is nevertheless not as much as the decrease for deterministic SkipNet [11] . In Table (U we 
highlight how our methods compare with previous related solutions. 
Our improvements are based on the following: 

• Rainbow connections: collections of parallel links between related components in the data 
structure. These connections allow for a high degree of connectivity between related compo- 
nents without the need to use more than a constant amount of memory per node. 

• Hydra components: components of related nodes organized so that deleting even a constant 
fraction of the nodes in the component leaves a relatively large connected subcomponent. 

We use the rainbow connections with the hydra components and erasure codes so that we can 
fully recover from significant sets of node deletions, even to recover all the lost data. We present a 
periodic failure recovery mechanism that can, with high probability, restore the correct structure 
even if each node fails independently with constant probability less than one. If k nodes have failed, 
the repair mechanism uses 0(min(n, /c log n)) messages over O(log^n) rounds of message passing. 
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Figure 2: A skip graph. The dashed hnes show the separations between the different levels. 

2 Non-Redundant Rainbow Skip Graphs 

Skip graphs [H fT2] can be viewed as a distributed extension of skip lists [17]. Both skip lists and 
skip graphs consist of a set of increasingly sparse doubly-linked lists ordered by levels starting at 
level 0, where membership of a particular node x in a list at level i is determined by the first i bits 
of an infinite sequence of random bits associated with x, referred to as the membership vector of x, 
and denoted by m{x). We further denote the first i bits of m{x) by m{x)\i. In the case of skip lists, 
level i has only one list, for each i, which contains all elements x s.t. m{x)\i = 1*, i.e., all elements 
whose first i coin flips all came up heads. As this leads to a bottleneck at the single node present in 
the uppermost list, skip graphs have 2* lists at level i, which we will index from to 2* — 1. Node x 
belongs to the jth list of level i if and only if m(x)\i corresponds to the binary representation of j. 
Hence, each node is present in one list of every level until it eventually becomes the only member 
of a singleton list. (See Figure [21) 

It is useful to observe that the set of all lists to which a particular node x belongs meets the 
definition of a skip list, with membership in level i determined by comparison to m{x)\i rather than 
to V. With this observation, the algorithms and time analysiqj for searching, insertion, and well- 
behaved deletion in a skip graph all follow directly from the corresponding algorithms and analysis 
of skip lists. Nodes in a skip graph have out-degree proportional to the height of their corresponding 
skip list, which has been shown to be G(logn) with high probability; thus, the storage requirement 
of each node includes 6(logn) address pointers to other nodes. 

In the following subsection, we present a scheme that results in a new overlay structure, which 
we call a non-redundant rainbow skip graph. This structure has the property that each node has 
constant out-degree, and hence need store only a constant number of pointers to other nodes, 
matching the best-known results of the family tree [21]. Moreover, as we show, the non-redundant 
rainbow skip graph has 0(logn/n) congestion. In subsequent sections, we show how to augment 
the non-redundant rainbow skip graph to support ill-mannered node deletions, or node failures, 
in which a node leaves the network without first notifying its neighbors and providing the nec- 
essary information to update the graph structure. More significantly, this scheme will, with high 
probability, allow us to efficiently restore the proper structure even if all nodes simultaneously fail 
independently with some constant probability less than one. In particular, we will be capable of 

*For the sake of simplicity, we assume that the nodes in our network are synchronized; we address in the full 
version the concurrency issues that relaxing this assumption requires. 
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reconnecting components of the graph that are disconnected after the failures. We refer to this 
augmented data structure as the rainbow skip graph. 



2.1 The Structure of Non-Redundant Rainbow Skip Graphs 

A non-redundant rainbow skip graph on n nodes consists of a skip graph on 0(n/logn) supernodes, 
where a supernode consists of 0(log n) nodes that are maintained in a doubly-hnked hst that we will 
refer to as the core list of the supernode. The nodes are partitioned into the supernodes according 
to their keys so that each supernode represents a contiguous subsequence of the ordered sequence 
of all keys. The smallest key of a supernode S will be referred to as the key of S, and we use these 
keys to define the skip graph on the supernodes. For each supernode S, we associate a different 
member of S with each level i of the skip graph, and call this member the level i representative 
of S, which we denote as Si. The level i list to which S belongs will contain Si. Collectively we 
refer to these lists of the skip graph as the level lists. Si, which can be chosen arbitrarily from 
among the elements of S, will be connected to Si+i and Si-i, which we respectively call the parent 
and child of Si. These vertical connections form another linked list associated with supernode S 
that we refer to as the tower list of S. By maintaining the supernodes so that their size is greater 
than their height in the skip graph, each member of a supernode will belong to at most three 
lists — the core list, the tower list, and one level list. The issue of supernode size and height will 
arise frequently, and we let S.size and S. height denote the size and height of S, respectively. The 
implicit connections between the nodes in a core list and their copies in the related tower list form 
one type of "rainbow" connections, which motivates the name of this structure. (See Figure O) 



Figure 3: A non-redundant rainbow skip graph. The lists on each level are shown in dashed ovals. 
The core lists on the bottom level are shown as sets of contiguous squares. The rainbow connections 
between one core list and its related tower list are also shown. 

Searching in a Non-Redundant Rainbow Skip Graph. To search for a node with key k 
from node x, we find the top-level representative of the supernode of x and then perform a standard 
skip graph search for the predecessor of k in the set of supernode keys. Once the predecessor of k 
is found, we linearly scan through the corresponding supernode until a key with value k or more 
is encountered, and return the address of the corresponding node to node x. Each of these steps 
requires O(logn) time, given that we properly maintain the size of every supernode to be O(logn). 
We subsequently address this maintenance in the discussion of insertion and deletion operations. 

Updating a Non-Redundant Rainbow Skip Graph. The method of maintaining supernode 
sizes of O(logn) is essentially the standard merge/split method such as that used in maintaining 
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B-trees — we set constants ci and C2 such that the size of a supernode is always between ci log n 
and C2 logn, merging two adjacent supernodes whenever one drops to a size less than ci logn, and 
splitting a supernode into two supernodes whenever it reaches a size of more than C2logn. The 
primary complication with this approach stems from the distributed setting, in which one cannot 
efficiently maintain the exact value of n at every node — to do so would require a message to every 
supernode upon every insertion. The common solution is to estimate the value of log n locally via 
some random process. 

With some slight modifications to a proof of a theorem from [2Tj we can arrive at the following. 

Theorem 1: In a skip graph on n nodes, the height of every node is 0(log n) with high probabihty. 

Proof: Consider first the probability that the height of some particular node x is ki logn or less. 
This happens only if all other nodes differ from m(x)|A;ilogn in at least one membership vector 

— n 1 _ i 

bit. The probability of this is [1 - ( i )fci i°g"]"-i = (1 - ^T'^ < 2(1 - < en'=i = e"" ' . 

1 — fci 

Thus the probability that there exists some node with height ki log n or less is at most ne~" 
Consider now the probability that the height of some particular node x is k2 log n or more. This 
happens if one or more of the other nodes agree with all bits of m{x)\k2 logn. This probability is 
no more than {n — < ^kl-i ■ It follows that the probability that there exists some node 

with height of /c2 log n or more is at most ■ Suitably chosen values of ki and k2 will yield a 
suitably-high probability. ■ 

Ideally we would simply use S. height as the estimate for logn. However, the height of a node 
can potentially change dramatically when its neighbor at the highest level is deleted, or when a 
new neighbor is inserted. This creates the potential for a cascading series of supernode merges 
and splits due solely to changes in this local probabilistic estimate of logn, which complicates the 
otherwise-straightforward amortization argument. For simplicity, we deal with this complication 
by maintaining an estimate log n' that is common to every node, in the following manner: whenever 
some supernode has a height that is outside of the range [| log n', 6 log n'] , we recompute the current 
number of nodes in the structure, n", (requiring 0(n) messages), set n' = n" , and rebuild the entire 
structure. With high probability, Theorem [T] guarantees that we rebuild only after the size of the 
structure has increased from n' to (n')^ or has decreased to nearly (n')^/^. In either case, this 
implies that, with high probability, i}{n") operations have been performed, which will suffice to 
yield an amortized cost of O(logn). 

We can now describe the insertion procedure. To insert a node x with key k we first search for 
the predecessor of k and insert x into the corresponding supernode S. If S.size exceeds 9 logn', 
then we split S into two equal-sized supernodes. The supernode containing the larger keys of S 
must then be inserted into the skip graph; it can be inserted by the standard insertion procedure of 
a skip graph, except that at each level, a different representative is inserted into the corresponding 
list. We omit the details of this operation as it is a relatively straightforward adaptation of the 
skip graph method. 

Similarly, if, upon deleting a node, S.size falls below 3 logn', then we merge S with one of it's 
neighbors, or simply transfer a fraction of the neighbor's nodes to S if the total number of nodes 
between them exceeds 9 logn. If S is merged with its neighbor, then the old supernode is deleted 
from the skip graph. 
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Load Balance in a Non-Redundant Rainbow Skip Graph. The following theorem bounds 
the congestion of a non-redundant rainbow skip graph. 

Theorem 2: The congestion of an n node non-redundant rainbow skip graph is 0{\ogn/n). 

Proof: There are three phases to the search: first, traversing the tower list to the top-level repre- 
sentative; second, traversing the level-lists to find the supernode with key nearest the destination; 
third, to traverse the core list of this supernode until finding the actual goal. Only an 0(^^^)- 
fraction of queries will pass through a particular node u with key k{u) in phases one and three 
because only that fraction of keys belong to the same supernode as x. To analyze the fraction of 
queries that pass through u in the second phase, we consider first the probability that a search with 
a particular start and destination pair (s,t) passes through node u, and then bound the average 
over all choices of (s,t). Let i be the level at which u is a representative in the skip graph. Note 
that a query from s to t will pass through node u only if the following two conditions hold: 

1. m{s)\i = m{u)\i 

2. $v s.t. k{u) < k{v) < k{t) and m{v)\i + 1 = m{s)\i + 1; 

that is, there is no node between u and t that is present in level i + 1 of the start's skip list. 

These events are independent and thus the probability that they hold is exactly ^(1 — i^pF^Yi 
where d denotes the number of supernodes whose keys are between k{u) and k{t). Note that the 
particular choice of s plays no role in this probability. We thus average only over the choices of t. 
Noting further that for each distance d there are only O(logn) nodes whose supernodes fall at a 
distance of exactly d, we change the summation to be over d, yielding the following bound on the 
congestion of u: 

congestion{u) < 
< 



clogn 1 ^ _ 1 



clogn 1 , 1 , . . s 
r2 (geometric series) 

2c log n 



n 



In the section that follows we introduce a novel substructure that we term hydra components. 
Members of a hydra component will remain connected with high probability even if a constant 
fraction of the members simultaneously fail. By employing erasure-resilient codes, two hydra com- 
ponents can be " linked" so that the non- failing members of each component can remain connected 
to each other. Wc will use this to partition the lists of the non-redundant skip graph into hydra 
components and link them via "rainbow connections" in such a way that all non-failing represen- 
tatives of a given level will remain connected to each other, and such that the representatives of 
level i will remain connected to representatives of level i + 1. Collectively this will ensure total 
connectivity and provide a simple and efficient means of locally correcting the link structure of the 
rainbow skip graph. 
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3 Hydra Components 



We now describe hydra components — collections of nodes organized in such a way that if each 
member fails independently with constant probability p, then with high probability the nodes 
that remain can collectively compute the critical network-structure information of all the nodes in 
that component, including those which have failed. This "critical information" should consist of 
whatever information is necessary in order to remove local failed nodes from the overlay network 
and recompute correct links for the nodes that remain. Although in principal these failure-resilient 
blocks can be designed to handle any constant failure probability less than 1, for simplicity we 
define them to handle a failure probability of 1/2. 

An (n, c, I, r)-erasure-resilient code consists of an encoding algorithm and a decoding algorithm. 
The encoding algorithm takes a message of n bits and converts it into a sequence of /-bit packets 
whose total size is cn bits. The decoding algorithm is able to recover the original message from 
any set of packets whose total length is rn. Alon and Luby \T\ provide a deterministic (re, c, /, r)- 
erasure-resilient code with linear-time encoding and decoding algorithms with / = 0(1). Although 
these codes are not generally the most practical, they give the most desirable theoretical results. 

To build the hydra-components, we make use of a 2(i-regular graph structure consisting of 
the union of d Hamiltonian cycles. The set of all such graphs on n vertices is denoted by Hn^d- 
A (fi, d, (5)-hydra-component consists of a sequence of /i nodes logically connected according to a 
random element of H^^d, with each node storing an equal share of a message encoded by a suitably- 
chosen (n, c, Z, r)-erasure-resilient code. The parameters of the erasure-resilient code should be 
chosen in such a way that the entire message can be reconstructed from the collective shares of 
information stored by any set of 5fi nodes, i.e., such that ^ = The message that is encoded will 
be the critical information of the component. Clearly if the critical information consists of M bits, 
then by evenly distributing the packets of the encoded message across all the nodes, 0{M/fj,) space 
is used per node, given that is a constant. In addition to this space, 0(/x) space will be needed 
to store structures associated with the erasure-resilient code. However, in our applications fj, will 
be no larger than the space needed for 0(1) pointers, i.e., no more than O(logre). 

To achieve a high-probability bound on recovery, we rely upon the fact that random elements 
of H^d are likely to be good expanders. In particular we make use of a theorem from [8], which is 
an adaptation of a theorem from [Tj [6] . We state the theorem below, without proof. 

Theorem 3: Let V be a set of /i vertices, and let < 7, A < 1. Let G be a member of H^ d 
defined by the union of d independent randomly-chosen Hamiltonian cycles on V. Then, for all 
subsets WofV with Xfi vertices, G induces at least one connected component on W of size greater 
than with probability at least 

1 _ pM[(l+A)ln2+d(alna+/31n/3-(l-A)ln(l-A))]+0(l) 

where a = 1 — ^-J^A and [3 = 1 — ^-5-^A. 

With suitably chosen /i, A, and 7, Theorem [3] will guarantee that with high probability at 
least ^Xji nodes are connected, conditioned on the event that A// nodes of the component have not 
failed. By applying Chernoff bounds, it can be shown that this event occurs with high probability 
for suitably chosen ji and A. These facts directly yield the following lemma and theorem. 
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Lemma 1: If each node of a component of (5\ogn nodes fails independently with probability 
then the number of non-failing nodes is no less than Xfi with probability at least 

P log n 

^"^(l+5)(l+<5)^ 

where 6 = 1 — 2A. Proof: Let X be the sum of failed nodes. With /i = /31ogn and the failure 
probability = ^, E{X) = ^ ' " . Since each node fails independently, a Chernoff bound can be 
applied to the probability that more than (1 — A)/u nodes would fail: 

Pr[X>(l-A)^] = Pr[X > {I + 5)^^] 

, g5 , /9 log " 



Theorem 4: For any constant k, there exist constants d, (3, and 5 such that with probability 
1 — 0{:^) the critical information of a (f3logn,d,6) hydra component can be recovered in O(logn) 
time when each node of the component is failed with probability ^. 

Proof: Setting /3 to 40 and A to ^ provides a lower bound of 1 — O(^) on the probability that 
at least ^ of the nodes do not fail. Theorem [3] can now be applied, with /x = /51ogn, 7 = ^, and 
d = 47, to guarantee that there is a connected component amongst the ^-fraction of unfailed nodes 
of size 1^ with probability 1 — O(^). Thus the probability that the conditions of Theorem [3] and 
Lemma [T] both hold is 1 — 0(;^) for the given values of the parameters. This can be extended to 
any other value of k, for example by changing /?, which appears in the exponent of both probability 
terms, by a factor of |. 

By employing an intelligent flooding mechanism, the packets held by the ^^20^" connected 
nodes can be collected together in O(logn) messages. The erasure-resilient code can then be used 
to reconstruct the critical information by choosing parameters of the code so that = ^ = Jj- ■ 

For the remainder of the paper, when we use the term "hydra component" we will implicitly 
mean the (0(log n), 0(1), 0(1)) hydra components described in the proof of Theorem [J] unless 
otherwise stated. 



4 Rainbow Skip Graphs 

Having defined a hydra component, what remains is to describe how the nodes of a non-redundant 
skip graph are partitioned into hydra components to yield the complete rainbow skip graph. 

Let 9/?log(n') be the minimum size of each hydra component, which will be at least /31ogn 
with high probability. The maximum size of a hydra component will be maintained as 27/31og(n'). 
The elements of the level lists are partitioned into hydra components with respect to their order 
in the lists — elements of a contiguous sublist will be placed together in a hydra component. We 
call such hydras the level-list hydras. Naturally, sometimes a list, or the remaining end of a list, 
will be too small to fill an entire hydra component. In such cases, the partition will span into the 
beginning of the next list of the same level. That is, if a hydra component containing the end of 
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the jth list of level i is smaller than 9/31og(n'), then the hydra component will also contain the 
beginning of the (j + l)-th list of level i, or the 0th list of level z — 1 if j = 2* — 1. The core lists 
and tower lists will be partitioned in a similar manner, but we group the core list and tower list of 
each supernode together as a pair, since they are inherently tied together through the supernodes, 
one being a subset of the other. We call these hydras the supernode hydras. Again when the hydra 
is too small, supernodes adjacent with respect to their keys are grouped together. 



Figure 4: Rainbow connections between hydra components. 

The primary critical information that will be associated with every hydra component is an 
ordered list of the addresses of every node in the component. This information will ensure that 
the unfailed nodes of a component can restore connectivity to each other locally. In the case of 
the level-list hydras, the critical information will also include the addresses of the parents and 
children of every element. Additionally, every pair of adjacent level-list hydras L and R will be 
linked together by "rainbow connections" , which amount to storing the addresses of all elements of 
R in the critical information of L and vice- versa. (See Figure [H) In total the critical information 
consists of O(logn) pointers, which when distributed evenly as encoded packets to the 0(logn) 
members requires space corresponding to 0(1) pointers. 

Hydra components are maintained in the same way as supernodes, with the same merge/split 
mechanism. By design, the cost of a split or merge amortizes to a constant amount of overhead. 
Whenever a node is added or removed from a hydra component, we must recompute and redistribute 
the encoded critical information of that hydra and all (of the constant number of) hydras to which 
it is linked, requiring time proportional to the size of the hydra. Noting that every node in the 
rainbow skip graph belongs to a constant number of hydra components, we arrive at the following. 

Theorem 5: The amortized message cost of insertion and well-behaved deletion, U{n), in a hydra- 
augmented rainbow skip graph is O(logn) with high probability. 

Proof: The addition of hydras will not affect the global rebuilding costs, so we look only at 
the cost associated with normal insertions into a supernode S. When S does not split or merge 
during insertion, only the supernode hydra associated with S changes, which requires O(logn) 
time. Deletion may also entail a change at a single level-list hydra when a representative of S is 
being deleted, again incurring only O(logn) cost. When S does split, we must insert a supernode 
into the skip graph and potentially must change the representatives of S so that they all come 
from the lower half of the range of keys in S. This will in total change O(logn) hydras, requiring 
0(log^ n) time. Similar reasoning applies for merging. However, a supernode splits or merges only 
after fi(logn) insertions and deletions, and thus the amortized time is O(logn). ■ 

We now describe the procedure to restore the structure after some number of nodes have failed. 
For simplicity we assume that no additional failures occur during the repair process; otherwise, it 
would be necessary to extend the model to reflect the rate at which nodes are failing with respect 
to time. 
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We initially assume that no supernode drops below the minimum size constraints. The goal 
will be to replace each failed representative with a new node from the same supernode. In parallel, 
each hydra H in which at least one node has failed recovers its critical information. The core-list 
information of each supernode S is first used to determine new representatives at each level i with 
a failed representative. This new Si is then linked to Si-i and S'j+i. 

With the parent/child information now corrected, what remains is to repair the sibling pointers 
of each level list. As a basis, we repair level by using the rainbow connections to identify some 
unfailed member of each supernode that neighbors a failed representative Sq. These nodes are 
sent replacement messages that indicate the new Sq, which is then connected to the appropriate 
neighbors. 

The other lists are then repaired sequentially by order of increasing level. Suppose that the lists 
of level i are being repaired, and hence that the lists at levels through i — 1 have already been 
repaired. Let S be some supernode which contains a failed representative at level i. To restore 
the proper structure, we wish to pass an insertion message to the neighbors of S at level i that 
contains the address of the new Si. We use to enter the list at level i — 1 and pass replacement 
messages to its left and right neighbors, which forward the messages until nodes belonging to the 
same level i list are reached. With high probability the distance to these nodes is O(logn). 

Once all levels are repaired, the hydra codes are recomputed, completing the repair process. 
Careful accounting of all actions described above yields the following theorem. 

Theorem 6: The failure recovery procedure restores the correct structure of the rainbow skip 
graph with high probabihty using O(log^n) rounds of message passing and 0{min{n, k log n)) 
messages, where k is the number of nodes that failed. 

5 Strong Rainbow Skip Graphs 

The rainbow skip graph achieves O(logn) query time w.h.p. Although the random construction 
of the hydra components is critical for the rainbow skip graph to be failure-resilient, we are still 
able to de-randomize other parts of the structure to get an efficient worst-case search time. The 
resulting data structure, which we call the strong rainbow skip graph, will function as a deterministic 
peer of the family tree [21] (which is non-trivial to de-randomize), and will additionally provide 
powerful failure-resilience. The idea is to integrate the randomly constructed hydra components 
into a deterministic non-redundant rainbow skip graph in the same way it is integrated into a 
randomized non-redundant rainbow skip graph. This can be done once we have a deterministic 
skip graph at hand, noting that the method of partitioning and constructing hydra components 
is independent of the underlying skip graph. Doing this will, in addition to guaranteeing a worst 
case search time, yield a tighter size for the supernodes and more freedom to rebuild than that of 
the ordinary rainbow skip graph. We discuss the size of the supernodes and that of the underlying 
deterministic skip graph in the next, then provide a novel deterministic skip graph in Section \5.1\ 
and present how to update this structure in Section 15.21 and 15.31 

Size of Supernodes. Let n be the number of keys when the current skip graph was built and 
/ = logra. We maintain a supernode size between [^,3/ — 2]. If a deletion turns a supernode S into 
/ — 1 size, then S either borrows a key from the next block S' if there are at least I + 1 keys in S' , 
or merges with S' into a new supernode of size 21 — 1. Similarly, if an insertion turns S into size 
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31 — 1, S either gives the additional key to S' if S' contains at most 31 — 3 keys, or merges with 5' 
and then sphts into three supernodes of size 21 — 1. Thus any newly generated supernode is of size 
21 — 1, so it will tolerate at least I insertions or deletions before the next merge or split. 

Size of The Skip Graph. A newly rebuilt skip graph has n keys and n/logn supernodes. We 
can rebuild it as soon as it grows to n' = n or shrinks to n' = n/21ogre supernodes. (As in the 
randomized case, the number of supernodes can be estimated by the height of each tower list so 
there is no need of any global information.) This provides that before a rebuild there have been at 
least (n' — n'/logn) supernode insertions or n' supernode deletions, so that the time for rebuild, 
which is 0(n' log n), upon being amortized, requests only constant credits from each key insertion 
and deletion. On the other hand, in order to keep the O(logn) search time, the current skip graph 
can accommodate as many as n" = or as few as n" = n^^^ supernodes. Therefore we can rebuild 
at any point between n' and n" without affecting either the amortized update time or the worst-case 
search time. 

As summarized in the following theorem, we are left with the task of building a deterministic 
skip graph as the underlying structure of the strong rainbow skip graph. 

Theorem 7: If there is a deterministic skip graph with worst-case query time Q, worst-case update 
time U, and congestion C , then there is also a strong rainbow skip graph with worst-case query 
time Q, amortized update time U, congestion C, and resihence to node failures with constant 
probability 

5.1 A Deterministic Skip Graph with Near Optimal Congestion 

Harvey and Munro [llj gave a deterministic SkipNet with O(logn) search time and O(log^n) 
update time, which could be used in the construction in Theorem [71 However each node in [llj 
has three parents and two to five children so that the congestion at a top level node could be as 
bad as (1/3)^°S5" = n~^°^^^ = n~^'^^. Here we provide another deterministic skip graph with the 
same search and update time but only (l/2)^°^2(fc+i)/fc"' = ^-i/(i+iog((A:+i)/fc)) congestion, with k 
being a free parameter, where each node has two parents and only (2 + 1/k) children in average. 
Compared with the optimal congestion 0(logn/n) that is only achieved by randomized structures, 
our deterministic skip graph has 0(n^/n) congestion, which is the first deterministic p2p structure 
to provide such near optimal load balance. 

Macro Structure. Like in a randomized skip graph or SkipNet, the macro structure of our new 
deterministic skip graph is still an upside-down tree consisting of sublists, where there are 2* sublists 
at the i-th level. Each sublist has a father (0) list and a mother (1) list in the level above it. Each 
key X still has a membership string whose first i bits determine the sublist at level i to which the 
copy of X at this level belongs. However, some bits of the string might be blurred, by which we mean 
that X is present but not counted at these levels when we balance the skip graph. Furthermore, 
some bits might be undetermined, meaning that x is not yet inserted into these levels and they 
hence have no value. Naturally, if bit i is undetermined, then any bit j > i is also undetermined. 

The balance property of this skip graph is then that, in any sublist at any level, consecutive 
non-blurred nodes promote to the father list and the mother list alternatingly, and the blurred 
nodes are spread more than k steps apart. We may think of these normal nodes as promoting to 
the next level in pairs, where in each pair of sibling nodes one goes to the father and the other goes 
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to the mother. And we may call this pair of siblings the dual of each other. As defined, a blurred 
node may promote to either the father list or the mother list, or neither of them (when the next 
digit is undetermined). That is, the deterministic skip graph deviates from a perfect structure, 
where consecutive nodes in each list promote to father list and the mother list alternatingly, by 
tolerating some sparsely spread blurred nodes. 

Theorem 8: (Height is near perfect.) The height of any tower in the skip graph is between 

log n I log n 

l+log((fc+l)/fc) l+log((fc+l)/(fc+2)) • 

Proof: If there are m normal nodes and to m/k (the max. possible) blurred nodes at level 
i, then there are at most (m/2 + m/k) and at least m/2 nodes in either of its parent lists at 
level i + 1. So the shrinking ratio from the size of a child list to that of a parent list is between 
(1 + l/k)/(l/2 + l/k) = 2{k + l)/{k + 2) and (1 + l/A;)/(l/2) = 2{k + l)/k, so that the height of 
any skip list inside this skip graph is between \og2{k+i)/k = i+iog((l+i)/fc) log2(fe+i)/(fe+2) n = 

loRW _ 

l+log((fc+l)/(fe+2)) ■ ■ 



Theorem 9: (Congestion is near optimal.) The maximum congestion at any tower, i.e., the con- 
gestion of the skip graph, is at most 0{logn/n^~'^ ) = 0{rf /n) with e only depending on k. 

Proof: We calculate the congestion at each node and then sum it for each tower. Let S and T be 
the start and destination of randomly chosen search. In a perfect skip graph with two parents and 
two children for each node, the congestion at a node Xj at level i is the probability of S being inside 
the left or right subtree of Xi (which is 2^/n) times the probability of T being outside that subtree 
(relaxed to 1 and ignored) and then times the probability of choosing the right search tree involving 
Xi (which is 1/2*), so it is 1/n. In our skip graph, the probability of S being inside a subtree of 
Xi is at most {2{k + l)/ky/n, so the congestion is 0((l/2)'°S2(fe+i)/fc «) = (9(„-i/{i+iog((fc+i)/fe))) 

repeating the above calculation. ■ 

Next we show how to update this skip graph to maintain the balance property. We first provide 
update operations with amortized O(logn)^ time, then improve the running time to amortized 
O(logn) by grouping nodes into supernodes in a different way. 

5.2 Update Operations in The Deterministic Skip Graph 

As described above, the deterministic skip graph is an approximation to a perfect skip graph, 
with a balance constraint the any two blurred nodes in a same list are separated at least k steps 
away. When this is violated during updates and blurred nodes become too dense, we follow the 
procedure below to restore the balance property. A straightforward case is that, if there are two 
adjacent neighbors in a list that are both blurred, and they promote to the father and mother list 
alternatively and consistent to the alternation of other nodes in the same list, then we can simply 
unmark them as blurred nodes (unblur) and consider them two normal nodes. Then, if two blurred 
nodes are within k steps, we can do local surgeries to the structure to yield similar result as in the 
straightforward case. Details are as follows. 
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Swap. To swap two neighbor nodes at level i means to exchange their membership strings from 
the digit i + 1 above (as long as the digits are determined, including the blurred digits). This 
changes neither the structure of the skip graph nor the valid search property, since the two being 
neighbors at level i means that one can replace the other at any level above i. 

Unblur or Promote in Pairs. If, in any sublist, there are two blurred nodes within k steps 
away, and they promote to different parents, then we can unblur them both and fix the balance 
property, if it is violated, by 0{k) swaps. This works regardless of whether there is an odd or even 
number of normal nodes in between of the two blurred nodes. For example, if we use / and m to 
indicate that a node promotes to father list or mother list, and f(b) (or m{h)) means it is blurred 
and promotes to the father (or mother) list, then a sublist [xi = f,X2 = fib),X3 = m, X4 = /, X5 = 
m,XQ = m{b),X7 = f] will be fixed by swapping {x2,X3), (x^jX^) after unblurring X2 and xq, and 
a sublist [xi = /, X2 = /(6),X3 = m,X4 = /, X5 = m,XQ = /, xy = m{b),xs = m] will be fixed 
by swapping (x2,X3), (x4,X5) and (xgjXy) after unblurring X2 and xj. A blurred node with the 
next digit undetermined can be promoted together with another blurred node regardless of which 
parent the latter promotes to. If there are two blurred nodes going to the same parent, then we 
cannot unblur them together no matter how close they are. However, it will turn out that a node is 
blurred only during the deletion, and that we can control the deletion so that if a node is going to 
be blurred and there exists another blurred node nearby, we can blur either the node or its dual so 
that the newly blurred node can always be coupled with the existing blurred one. (See the deletion 
below.) 

The unblurring of a blurred node with its next digit determined doesn't cause any change to 
the upper level. The promotion of a blurred node with the next digit undetermined, which unblurs 
this node and then inserts a blurred node (determines a bit) at the upper level, will propagate if 
the newly inserted node is within k steps to another blurred node in the same sublist. 

Insertion. To insert a key we first insert into the bottom list a blurred node and assign to it an 
empty membership string with all digits undetermined. We then propagate the promotions until 
the balance property is restored. 

Deletion. We delete all copies of the key x from the bottom level up to the top level. At each 
level i, in the first case, if there is no blurred node within k steps from Xi or its dual, then delete 
Xi and blur its dual. In the second case, if there is a blurred node yi within k steps that promotes 
to the same parent as Xj, then delete Xj and unblur y,, and accordingly fix the segment between yi 
and the dual of Xj by 0{k) swaps. In the last case, if the blurred node yi within k steps promotes 
to a different parent than Xj, then swap Xj with its dual to make it the second case. 

Theorem 10: The amortized message cost of each insertion and deletion in the deterministic skip 
graph is 0(log^ n). 

Proof: Each swap takes O(logn) messages since the two strings have O(logn) digits to exchange 
and exchanging a digit results in 0(1) pointer changes in the data structure. To each insertion 
we assign O(log^n) credits to the inserted blurred node at the bottom level, O(logn) credits per 
undetermined digit in its membership string. For a deletion, at each level it takes 0{k) swaps 
and blurs one node, to which we also assign O(logn) credits. Thus an insertion/deletion takes 
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0(log^ n) messages and each existing blurred node (bit) or undetermined digit carries O(logn) 
credits to afford for the future propagation of unblurrings/promotions. ■ 



5.3 Getting 0(\ogn) Update Time in The Strong Rainbow Skip Graph 

A direct grouping of keys into supernodes and integration of hydra components into the above 
deterministic skip graph resuhs in a rainbow skip graph with amortized O(log^n) update time, 
as mentioned in Theorem [71 To improve the update cost to O(logn), we propose a two-level 
grouping instead of the (one-level) grouping described in Section [2j Instead of grouping the keys 
into supernodes of size 0(log n), we now group them into supernodes of size 0((log^ n)), and further 
partition each supernode into a list of O(logn) sublists each of size O(logn), where the keys in the 
first sublist are all smaller than the keys in the second list, etc. This way the search/query time 
remains O(logn), because after a supernode containing the query key is located, to search for the 
key inside the supernode we only perform two levels of linear traverse of lists, each of size 0(log n). 
With splitting and merging the sublists inside a supernode in the same way a supernode is split 
or merged, we should be performing a supernode insertion or deletion only once every 0(log^n) 
insertions or deletions of keys, resulting in an update cost amortized to O(logn). To deploy in a 
distributed environment, we associate each sublist in a supernode to the i-th level representative 
of this supernode in the skip graph, and use any of the keys in the sublist to be the representative. 
By maintaining 0(logn)-sized hydra components, we can follow the same argument to see that the 
amortized update time after integrating hydra components is still O(logn). This two-level grouping 
scheme will bring one more factor of log n to the congestion because the number of keys associated 
to a supernode has increased by a factor of logn. Theoretically, however, this factor of logn is 
covered by the in Theorem [9l Moreover, since now we have a sublist of O(logn) candidate 
representatives for each node in the skip graph and only one of them is used, in practise we are able 
to cancel this increase of congestion by rotating the representative among the O(logn) candidates 
after a search passes this node. Thus we achieve (deterministic) amortized O(logn) update cost, 
adding no affect to other features of the rainbow skip graph. 

Theorem 11: The amortized message cost of insertion and well-behaved deletion, U{n), in a 
strong rainbow skip graph is O(logn). 
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